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Abstract 

Sparse coding consists in representing signals as sparse linear combinations of atoms selected from 
a dictionary. We consider an extension of this framework where the atoms are further assumed to 
be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regu- 
larization norm, which has proven useful in several applications. This norm leads to regularized 
problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving 
them. More precisely, we show that the proximal operator associated with this norm is computable 
exactly via a dual approach that can be viewed as the composition of elementary proximal opera- 
tors. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows 
the use of accelerated gradient techniques to solve the tree-structured sparse approximation prob- 
lem at the same computational cost as traditional ones using the ^i-norm. Our method is efficient 
and scales gracefully to millions of variables, which we illustrate in two types of applications: 
first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we 
apply our optimization tools in the context of dictionary learning, where learned dictionary ele- 
ments naturally self-organize in a prespecified arborescent structure, leading to better performance 
in reconstruction of natural image patches. When applied to text documents, our method learns 
hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. 
Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization. 



1. Introduction 

Modeling signals as sparse linear combinations of atoms selected from a dictionary has become 
a popular paradigm in many fields, including signal processing, statistics, and machine learning. 
This line of research, also known as sparse coding, has witnessed the development of several well- 
founded theoretical frameworks (Tibshirani, 1996; Chen et al, 1998; Mallat, 1999; Tropp, 2004, 
2006; Wainwright, 2009; Bickel et al., 2009) and the emergence of many efficient algorithmic 
tools (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Wright et al., 2009; Needell 
and Tropp, 2009; Yuan et al., 2010). 
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In many applied settings, the structure of the problem at hand, such as, e.g., the spatial arrange- 
ment of the pixels in an image, or the presence of variables corresponding to several levels of a given 
factor, induces relationships between dictionary elements. It is appealing to use this a priori knowl- 
edge about the problem directly to constrain the possible sparsity patterns. For instance, when the 
dictionary elements are partitioned into predefined groups corresponding to different types of fea- 
tures, one can enforce a similar block structure in the sparsity pattern — that is, allow only that either 
all elements of a group are part of the signal decomposition or that all are dismissed simultaneously 
(see Yuan and Lin, 2006; Stojnic et al., 2009). 

This example can be viewed as a particular instance of structured sparsity, which has been 
lately the focus of a large amount of research (Baraniuk et al., 2010; Zhao et al., 2009; Huang et al., 
2009; Jacob et al., 2009; Jenatton et al., 2009; Micchelli et al., 2010). In this paper, we concentrate 
on a specific form of structured sparsity, which we call hierarchical sparse coding: the dictionary 
elements are assumed to be embedded in a directed tree T , and the sparsity patterns are constrained 
to form a connected and rooted subtree of T (Donoho, 1997; Baraniuk, 1999; Baraniuk et al., 2002, 
2010; Zhao et al., 2009; Huang et al., 2009). This setting extends more generally to a forest of 
directed trees. 1 

In fact, such a hierarchical structure arises in many applications. Wavelet decompositions lend 
themselves well to this tree organization because of their multiscale structure, and benefit from it for 
image compression and denoising (Shapiro, 1993; Crouse et al., 1998; Baraniuk, 1999; Baraniuk 
et al., 2002, 2010; He and Carin, 2009; Zhao et al., 2009; Huang et al., 2009). In the same vein, 
edge filters of natural image patches can be represented in an arborescent fashion (Zoran and Weiss, 
2009). Imposing these sparsity patterns has further proven useful in the context of hierarchical 
variable selection, e.g., when applied to kernel methods (Bach, 2008), to log-linear models for the 
selection of potential orders (Schmidt and Murphy, 2010), and to bioinformatics, to exploit the tree 
structure of gene networks for multi-task regression (Kim and Xing, 2010). Hierarchies of latent 
variables, typically used in neural networks and deep learning architectures (see Bengio, 2009, and 
references therein) have also emerged as a natural structure in several applications, notably to model 
text documents. In particular, in the context of topic models (Blei et al., 2003), a hierarchical model 
of latent variables based on Bayesian non-parametric methods has been proposed by Blei et al. 
(2010) to model hierarchies of topics. 

To perform hierarchical sparse coding, our work builds upon the approach of Zhao et al. (2009) 
who first introduced a sparsity-inducing norm Q. leading to this type of tree-structured sparsity pat- 
tern. We tackle the resulting nonsmooth convex optimization problem with proximal methods (e.g., 
Nesterov, 2007; Beck and Teboulle, 2009; Wright et al., 2009; Combettes and Pesquet, 2010) and we 
show in this paper that its key step, the computation of the proximal operator, can be solved exactly 
with a complexity linear, or close to linear, in the number of dictionary elements — that is, with the 
same complexity as for classical i\ -sparse decomposition problems (Tibshirani, 1996; Chen et al., 
1998). Concretely, given an m-dimensional signal x along with a dictionary D = [d 1 , . . . , d p ] S M mxp 
composed of p atoms, the optimization problem at the core of our paper can be written as 

1 , 
min - x-Da a), with A, > 0. 

aeW>2 

In this formulation, the sparsity-inducing norm Q. encodes a hierarchical structure among the atoms 
of D, where this structure is assumed to be known beforehand. The precise meaning of hierarchical 

1. A tree is defined as a connected graph that contains no cycle (see Ahuja et al., 1993). 
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structure and the definition of Q. will be made more formal in the next sections. A particular instance 
of this problem — known as the proximal problem — is central to our analysis and concentrates on 
the case where the dictionary D is orthogonal. 

In addition to a speed benchmark that evaluates the performance of our proposed approach in 
comparison with other convex optimization techniques, two types of applications and experiments 
are considered. First, we consider settings where the dictionary is fixed and given a priori, corre- 
sponding for instance to a basis of wavelets for the denoising of natural images. Second, we show 
how one can take advantage of this hierarchical sparse coding in the context of dictionary learn- 
ing (Olshausen and Field, 1997; Aharon et al., 2006; Mairal et al., 2010a), where the dictionary is 
learned to adapt to the predefined tree structure. This extension of dictionary learning is notably 
shown to share interesting connections with hierarchical probabilistic topic models. 

To summarize, the contributions of this paper are threefold: 

• We show that the proximal operator for a tree-structured sparse regularization can be com- 
puted exactly in a finite number of operations using a dual approach. Our approach is equiva- 
lent to computing a particular sequence of elementary proximal operators, and has a complex- 
ity linear, or close to linear, in the number of variables. Accelerated gradient methods (e.g., 
Nesterov, 2007; Beck and Teboulle, 2009; Combettes and Pesquet, 2010) can then be applied 
to solve large-scale tree-structured sparse decomposition problems at the same computational 
cost as traditional ones using the ^i-norm. 

• We propose to use this regularization scheme to learn dictionaries embedded in a tree, which, 
to the best of our knowledge, has not been done before in the context of structured sparsity. 

• Our method establishes a bridge between hierarchical dictionary learning and hierarchical 
topic models (Blei et al., 2010), which builds upon the interpretation of topic models as 
multinomial PCA (Buntine, 2002), and can learn similar hierarchies of topics. This point 
is discussed in Sections 5.5 and 6. 

Note that this paper extends a shorter version published in the proceedings of the international 
conference of machine learning (Jenatton et al., 2010). 

1.1 Notation 

Vectors are denoted by bold lower case letters and matrices by upper case ones. We define for q > 1 
the ^-norm of a vector x in W" as ||x|| 9 = (X?li jx,-^) 1 / 9 , where x,- denotes the i-th coordinate of x, 
and ||x||oo = max != i |x, | = lim^oo ||x|| 9 . We also define the ^o-pseudo-norm as the number of 
nonzero elements in a vector: 2 ||x||o = #{/ s.t. x,- ^ 0} = lim q ^ + (K=i \ x i\ q )- We consider the 
Frobenius norm of a matrix X in E mx ": ||X||f = (Lfii Lj=i X? ) 1 / 2 , where X i; - denotes the entry 

of X at row i and column j. Finally, for a scalar y, we denote (y) + = max(;y,0). 

The rest of this paper is organized as follows: Section 2 presents related work and the prob- 
lem we consider. Section 3 is devoted to the algorithm we propose, and Section 4 introduces the 
dictionary learning framework and shows how it can be used with tree-structured norms. Section 5 
presents several experiments demonstrating the effectiveness of our approach and Section 6 con- 
cludes the paper. 

2. Note that it would be more proper to write ||x||q instead of ||x||o to be consistent with the traditional notation ||x|| 9 . 
However, for the sake of simplicity, we will keep this notation unchanged in the rest of the paper. 
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2. Problem Statement and Related Work 

Let us consider an input signal of dimension m, typically an image described by its m pixels, which 
we represent by a vector x in W". In traditional sparse coding, we seek to approximate this signal 
by a sparse linear combination of atoms, or dictionary elements, represented here by the columns of 
a matrix D = [d\ . . . ,d*] in R mx '\ This can equivalently be expressed as x« Da for some sparse 
vector a in IR P , i.e, such that the number of nonzero coefficients ||a||o is small compared to p. The 
vector a is referred to as the code, or decomposition, of the signal x. 




Figure 1 : Example of a tree 1 when p = 6. With the rule we consider for the nonzero patterns, if 
we have as / 0, we must also have a^ / for k in ancestors (5) = {1,3,5}. 

In the rest of the paper, we focus on specific sets of nonzero coefficients — or simply, nonzero 
patterns — for the decomposition vector a. In particular, we assume that we are given a tree 3 T 
whose p nodes are indexed by j in {1 , . . . ,p}. We want the nonzero patterns of a to form a connected 
and rooted subtree of T; in other words, if ancestors (j) C {1,...,/?} denotes the set of indices 
corresponding to the ancestors 4 of the node j in T (see Figure 1), the vector a obeys the following 
rule 

a ; / =^> [a* / for all k in ancestors (j)]. (1) 

Informally, we want to exploit the structure of T in the following sense: the decomposition of any 
signal x can involve a dictionary element d ; only if the ancestors ofd J in the tree T are themselves 
part of the decomposition. 

We now review previous work that has considered the sparse approximation problem with tree- 
structured constraints (1). Similarly to traditional sparse coding, there are basically two lines of 
research, that either (A) deal with nonconvex and combinatorial formulations that are in general 
computationally intractable and addressed with greedy algorithms, or (B) concentrate on convex 
relaxations solved with convex programming methods. 

2.1 Nonconvex Approaches 

For a given sparsity level s > (number of nonzero coefficients), the following nonconvex problem 

1„ ll2 

min - x — Da 2 such that condition ( 1 ) is respected , (2) 

CLEW 2 

\\a\\ <s 



3. Our analysis straightforwardly extends to the case of a forest of trees; for simplicity, we consider a single tree T . 

4. We consider that the set of ancestors of a node also contains the node itself. 
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has been tackled by Baraniuk (1999); Baraniuk et al. (2002) in the context of wavelet approxima- 
tions with a greedy procedure. A penalized version of problem (2) (that adds A,||oc||o to the objec- 
tive function in place of the constraint \\cn\\o < s) has been considered by Donoho (1997), while 
studying the more general problem of best approximation from dyadic partitions (see Section 6 in 
Donoho, 1997). Interestingly, the algorithm we introduce in Section 3 shares conceptual links with 
the dynamic-programming approach of Donoho (1997), which was also used by Baraniuk et al. 
(2010), in the sense that the same order of traversal of the tree is used in both procedures. We 
investigate more thoroughly the relations between our algorithm and this approach in Appendix A. 

Problem (2) has been further studied for structured compressive sensing (Baraniuk et al., 2010), 
with a greedy algorithm that builds upon Needell and Tropp (2009). Finally, Huang et al. (2009) 
have proposed a formulation related to (2), with a nonconvex penalty based on an information- 
theoretic criterion. 

2.2 Convex Approach 

We now turn to a convex reformulation of the constraint (1), which is the starting point for the 
convex optimization tools we develop in Section 3. 

2.2.1 Hierarchical Sparsity-Inducing Norms 

Condition (1) can be equivalently expressed by its contrapositive, thus leading to an intuitive way 
of penalizing the vector a to obtain tree-structured nonzero patterns. More precisely, defining 
descendants (j) C {1, . . . ,p} analogously to ancestors(j) for j in {1, .. . ,p}, condition (1) amounts 
to saying that if a dictionary element is not used in the decomposition, its descendants in the tree 
should not be used either. Formally, this can be formulated as: 



From now on, we denote by g the set defined by g = {descendants (j); j £ {1, . . . ,p}}, and refer to 
each member g of g as a group (Figure 2). To obtain a decomposition with the desired property (3), 
one can naturally penalize the number of groups g in Q that are "involved" in the decomposition 
of x, i.e., that record at least one nonzero coefficient of a: 



While this intuitive penalization is nonconvex (and not even continuous), a convex proxy has been 
introduced by Zhao et al. (2009). It was further considered by Bach (2008); Kim and Xing (2010); 
Schmidt and Murphy (2010) in several different contexts. For any vector a G W , let us define 



where 0C| g is the vector of size p whose coordinates are equal to those of a for indices in the set g, 
and to otherwise 5 . The notation |.| stands in practice either for the £2- or ^-norm, and ((s}g) ge g 

5. Note the difference with the notation a g , which is often used in the literature on structured sparsity, where a g is a 
vector of size \g\. 



(Xj = =>• [<Xjt = for all k in descendants (j)]. 



(3) 




1 if there exists j G g such that oc y - 7^ 0, 
otherwise. 



(4) 



n(oc) = £ <o g \\a 
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denotes some positive weights 6 . As analyzed by Zhao et al. (2009) and Jenatton et al. (2009), 
when penalizing by £2, some of the vectors (X| g are set to zero for some g £ Q ? Therefore, the 
components of a corresponding to some complete subtrees of T are set to zero, which exactly 
matches condition (3), as illustrated in Figure 2. 




Figure 2: Left: example of a tree-structured set of groups Q (dashed contours in red), corresponding 
to a tree 1 with p = 6 nodes represented by black circles. Right: example of a sparsity pattern 
induced by the tree-structured norm corresponding to Q: the groups {2,4}, {4} and {6} are set to 
zero, so that the corresponding nodes (in gray) that form subtrees of T are removed. The remaining 
nonzero variables {1,3,5} form a rooted and connected subtree of T . This sparsity pattern obeys 
the following equivalent rules: (i) if a node is selected, the same goes for all its ancestors, (ii) if a 
node is not selected, then its descendant are not selected. 

Note that although we presented for simplicity this hierarchical norm in the context of a single 
tree with a single element at each node, it can easily be extended to the case of forests of trees, 
and/or trees containing arbitrary numbers of dictionary elements at each node (with nodes possibly 
containing no dictionary element). More broadly, this formulation can be extended with the notion 
of tree-structured groups, which we now present: 

Definition 1 (Tree-structured set of groups.) 

A set of groups Q = {g}s>eg is said to be tree-structured in {1, . . . ,p}, if \J gE gg = 0> • • • :P} an d if 
for all g,h£ Q, (gCih 7^ 0) =>• (g C hor h C g). For such a set of groups, there exists a ( non-unique ) 
total order relation H such that: 

g^h => {gQh or gnh = %}. 

Given such a tree-structured set of groups Q and its associated norm Q., we are interested throughout 
the paper in the following hierarchical sparse coding problem, 

min/(a)+A£2(a), (5) 

aeMP 

where Q. is the tree-structured norm we have previously introduced, the non-negative scalar X is a 
regularization parameter controlling the sparsity of the solutions of (5), and / a smooth convex loss 

6. For a complete definition of fl for any ^-norm, a discussion of the choice of q, and a strategy for choosing the 
weights Cflg (see Zhao et al., 2009; Kim and Xing, 2010). 

7. It has been further shown by Bach (2010) that the convex envelope of the nonconvex function of Eq. (4) is in fact Q. 
with ||.|| being the ?oo-norm. 
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function (see Section 3 for more details about the smoothness assumptions on /). In the rest of the 
paper, we will mostly use the square loss /(a) = i ||x — Da|||, with a dictionary D in W" xp , but the 
formulation of Eq. (5) extends beyond this context. In particular one can choose / to be the logistic 
loss, which is commonly used for classification problems (e.g., see Hastie et al., 2009). 

Before turning to optimization methods for the hierarchical sparse coding problem, we consider 
a particular instance. The sparse group Lasso was recently considered by Sprechmann et al. (2010) 
and Friedman et al. (2010) as an extension of the group Lasso of Yuan and Lin (2006). To induce 
sparsity both groupwise and within groups, Sprechmann et al. (2010) and Friedman et al. (2010) 
add an l\ term to the regularization of the group Lasso, which given a partition (P of {1, . . . ,/?} in 
disjoint groups yields a regularized problem of the form 

min -llx — Docllo+A, Y ||oC|J|2 + ^/||oc||i. 
aeR'2 11 ^" 

Since <B is a partition, the set of groups in <£ and the singletons form together a tree-structured set 
of groups according to definition 1 and the algorithm we will develop is therefore applicable to this 
problem. 



2.2.2 Optimization for Hierarchical Sparsity-Inducing Norms 

While generic approaches like interior-point methods (Boyd and Vandenberghe, 2004) and subgra- 
dient descent schemes (Bertsekas, 1999) might be used to deal with the nonsmooth norm £2, several 
dedicated procedures have been proposed. 

In Zhao et al. (2009), a boosting-like technique is used, with a path-following strategy in the 
specific case where ||.|| is the ^-norm. Based on the variational equality 



1 p u 2 

Mli = ™n-[£-i + z;], (6) 
zeR' + I j=l Zj 



Kim and Xing (2010) follow a reweighted least-square scheme that is well adapted to the square 
loss function. To the best of our knowledge, a formulation of this type is however not available 
when |.| is the ^oo-norm. In addition it requires an appropriate smoothing to become provably 
convergent. The same approach is considered by Bach (2008), but built upon an active-set strategy. 
Other proposed methods consist of a projected gradient descent with approximate projections onto 
the ball {u £ W } ; Q.(u) < X} (Schmidt and Murphy, 2010), and an augmented-Lagrangian based 
technique (Sprechmann et al., 2010) for solving a particular case with two-level hierarchies. 

While the previously listed first-order approaches are (1) loss-function dependent, and/or (2) 
not guaranteed to achieve optimal convergence rates, and/or (3) not able to yield sparse solutions 
without a somewhat arbitrary post-processing step, we propose to resort to proximal methods 8 that 
do not suffer from any of these drawbacks. 



3. Optimization 

We begin with a brief introduction to proximal methods, necessary to present our contributions. 
From now on, we assume that / is convex and continuously differentiable with Lipschitz-continuous 

8. Note that the authors of Chen et al. (2010) have considered proximal methods for general group structure Q when 
||.|| is the IS -norm; due to a smoothing of the regularization term, the convergence rate they obtained is suboptimal. 
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gradient. It is worth mentioning that there exist various proximal schemes in the literature that differ 
in their settings (e.g., batch versus stochastic) and/or the assumptions made on /. For instance, the 
material we develop in this paper could also be applied to online/stochastic frameworks (Duchi and 
Singer, 2009; Hu et al., 2009; Xiao, 2010) and to possibly nonsmooth functions / (e.g., Duchi and 
Singer, 2009; Xiao, 2010; Combettes and Pesquet, 2010, and references therein). Finally, most of 
the technical proofs of this section are presented in Appendix B for readability. 

3.1 Proximal Operator for the Norm Q. 

Proximal methods have drawn increasing attention in the signal processing (e.g., Becker et al., 2009; 
Wright et al., 2009; Combettes and Pesquet, 2010, and numerous references therein) and the ma- 
chine learning communities (e.g., Bach et al., 2011, and references therein), especially because of 
their convergence rates (optimal for the class of first-order techniques) and their ability to deal with 
large nonsmooth convex problems (e.g., Nesterov, 2007; Beck and Teboulle, 2009). In a nutshell, 
these methods can be seen as a natural extension of gradient-based techniques when the objective 
function to minimize has a nonsmooth part. Proximal methods are iterative procedures. The sim- 
plest version of this class of methods linearizes at each iteration the function / around the current 
estimate d, and this estimate is updated as the (unique by strong convexity) solution of the proximal 
problem, defined as follows: 



The quadratic term keeps the update in a neighborhood where / is close to its linear approximation, 
and L > is a parameter which is an upper bound on the Lipschitz constant of V/. This problem 
can be equivalently rewritten as: 



Solving efficiently and exactly this problem is crucial to enjoy the fast convergence rates of proximal 
methods. In addition, when the nonsmooth term Q. is not present, the previous proximal problem 
exactly leads to the standard gradient update rule. More generally, we define the proximal operator: 

Definition 2 (Proximal Operator) 

The proximal operator associated with our regularization term TSi, which we denote by Proxy^i, is 
the function that maps a vector u G W to the unique solution of 



This operator was initially introduced by Moreau (1962) to generalize the projection operator onto 
a convex set. What makes proximal methods appealing for solving sparse decomposition problems 
is that this operator can be often computed in closed-form. For instance, 

• When Q. is the ^i-norm — that is, H(u) = ||u||i, the proximal operator is the well-known 
elementwise soft-thresholding operator, 




min /(a) + (a- a) V/(a) + XCl(a) + - ||a - a 




min -llu — vllo + AX2(v). 
yew 2 



(V) 



V/'e {!,..., p}, uy i-^sign(uy)(|uy| 
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• When Q. is a group-Lasso penalty with ^2-norms — that is, £2(u) = Y,geg \\ u \g lb> with Q being 
a partition of {1, . . . ,p}, the proximal problem is separable in every group, and the solution 
is a generalization of the soft-thresholding operator to groups of variables: 

v, G £ ! u l ^u lg -n l| . l|2 <,K] = |° i!; ^ U|g ^"j s 2 e " X 

where ilii n 2 <x denotes the orthogonal projection onto the ball of the ^2-norm of radius A,. 

• When Q. is a group-Lasso penalty with ^-norms — that is, £2(u) = Eggs ll u igll~> the solution 
is also a group-thresholding operator: 

Vge g, uigM-u^-iiy^uig], 

where n^n^x, denotes the orthogonal projection onto the ^-ball of radius A, which can be 
solved in O(p) operations (Brucker, 1984; Maculan and Galdino de Paula, 1989). Note that 
when ||U| g || i < X, we have a group-thresholding effect, with u !g — ri||.|| j<A.[ u ig] = 0- 

More generally, a classical result (see, e.g., Combettes and Pesquet, 2010; Wright et al., 2009) says 
that the proximal operator for a norm |.| can be computed as the residual of the projection of a 
vector onto a ball of the dual-norm denoted by ||.|| # , and defined for any vector K in M p by ||k||* = 
max|| z |i<i z T K. 9 This is a classical duality result for proximal operators leading to the different 
closed forms we have just presented. We have indeed that Prox^ || 2 = Id — Il|| ,\\ 2 <\ and Prox^ ^ = 
Id — ilii ilkx, where Id stands for the identity operator. Obtaining closed forms is, however, not 
possible anymore as soon as some groups in Q overlap, which is always the case in our hierarchical 
setting with tree-structured groups. 



3.2 A Dual Formulation of the Proximal Problem 



We now show that Eq. (7) can be solved using a dual approach, as described in the following 
lemma. The result relies on conic duality (Boyd and Vandenberghe, 2004), and does not make any 
assumption on the choice of the norm || . || : 



Lemma 1 (Dual of the proximal problem) 

Let uGf and let us consider the problem 

1 

max — - 

^gRPxl!?! 2 



Hi 



(8) 



s.t.\/ge§, \\^% < X® g and ^ = 0ifj^ g, 



where £ = (<^ 8 ) ge g and denotes the j-th coordinate of the vector in W\ Then, problems (7) 
and (8) are dual to each other and strong duality holds. In addition, the pair of primal-dual vari- 
ables {v,^} is optimal if and only ifh, is a feasible point of the optimization problem (8), and 



v = u-I ge ^ g and Vge §, ^ = n||.||,<^(v te + ^ 
where we denote by II|| .|| t <Xa> the orthogonal projection onto the ball {k € IR 



(9) 



< A.00 g }. 



9. It is easy to show that the dual norm of the £2- norm is the f2- norm itself. The dual norm of the too is the £i-norm. 
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Note that we focus here on specific tree-structured groups, but the previous lemma is valid regardless 
of the nature of g . The rationale of introducing such a dual formulation is to consider an equiva- 
lent problem to (7) that removes the issue of overlapping groups at the cost of a larger number of 
variables. In Eq. (7), one is indeed looking for a vector v of size p, whereas one is considering a 
matrix £, in R px l^l in Eq. (8) with Y, g eg \g\ nonzero entries, but with separable (convex) constraints 
for each of its columns. 

This specific structure makes it possible to use block coordinate ascent (Bertsekas, 1999). Such 
a procedure is presented in Algorithm 1. It optimizes sequentially Eq. (8) with respect to the vari- 
able If, while keeping fixed the other variables if, for h ^ g. It is easy to see from Eq. (8) that such 
an update of a column If, for a group g in Q , amounts to computing the orthogonal projection of 
the vector U| g — Y*h£g^\g onto trie ball of radius A-GO^ of the dual norm ||.||*. 

Algorithm 1 Block coordinate ascent in the dual 

Inputs: uGK p and set of groups Q . 
Outputs: iyff) (primal-dual solutions). 
Initialization: ^ = 0. 

while ( maximum number of iterations not reached ) do 
for g G Q do 

end for 
end while 



3.3 Convergence in One Pass 

In general, Algorithm 1 is not guaranteed to solve exactly Eq. (7) in a finite number of iterations. 
However, when |.| is the li- or £oo-norm, and provided that the groups in Q are appropriately or- 
dered, we now prove that only one pass of Algorithm 1, i.e., only one iteration over all groups, is 
sufficient to obtain the exact solution of Eq. (7). This result constitutes the main technical contribu- 
tion of the paper and is the key for the efficiency of our procedure. 

Before stating this result, we need to introduce a lemma showing that, given two nested groups 
g,h such that g C h C {1,.. . ,p}, if If is updated before if 1 in Algorithm 1, then the optimality 
condition for if is not perturbed by the update of if . 

Lemma 2 (Projections with nested groups) 

Let |.| denote either the l%- or £oo-norm, and g and h be two nested groups — that is, g C h C 
{1, . . . ,p}. Let nbe a vector in W, and let us consider the successive projections 

$*- n iui.<< g (%) and ^- n ii.ii,<**( u ifc-^)' 

with t g ,th > 0. Let us introduce v = u — If — % . The following relationships hold 

^ = n y ^(v lg + ^) and If =U u ^ th {y, h + t)- 

The previous lemma establishes the convergence in one pass of Algorithm 1 in the case where Q 
only contains two nested groups g C h, provided that If is computed before if . Let us illustrate 
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this fact more concretely. After initializing If and l^ 1 to zero, Algorithm 1 first updates If with 
the formula if <— Yl\\ ^ „<Xco g (U|g), and then performs the following update: if <— IT|| .||„<;uo A ( u |/! — If) 
(where we have used that If = & h since g C h). We are now in position to apply Lemma 2 which 

states that the primal/dual variables {y,lf ^} satisfy the optimality conditions (9), as described in 
Lemma 1. In only one pass over the groups {g,h}, we have in fact reached a solution of the dual 
formulation presented in Eq. (8), and in particular, the solution of the proximal problem (7). 

In the following proposition, this lemma is extended to general tree-structured sets of groups Q : 

Proposition 1 (Convergence in one pass) 

Suppose that the groups in Q are ordered according to the total order relation H of Definition 1, 
and that the norm \\.\\ is either the t%- or l^-norm. Then, after initializing £, to 0, a single pass of 
Algorithm I over Q with the order < yields the solution of the proximal problem (7). 

Proof The proof largely relies on Lemma 2 and proceeds by induction. By definition of Algo- 
rithm 1, the feasibility of E, is always guaranteed. We consider the following induction hypothesis 

<H (h) 4 {y g H h, it holds that If = n| N |,<^([u - L g ^htf]\ g + Z> g )}- 

Since the dual variables £, are initially equal to zero, the summation over g' <h, g' ^ g is equivalent 
to a summation over g' ^ g. We initialize the induction with the first group in Q , that, by definition 
of ^x, does not contain any other group. The first step of Algorithm 1 easily shows that the induction 
hypothesis 9{ is satisfied for this first group. 

We now assume that X (h) is true and consider the next group h' ,h<h' , in order to prove that 
X [h!) is also satisfied. We have for each group g C /j, 

? = n B . B ,^([u-^rtt+^ = n y .<^([u-jv^«'+?] IJ ). 

Since = If for g Qh! , we have 

and following the update rule for the group h', 

At this point, we can apply Lemma 2 for each group g C h, which proves that the induction hy- 
pothesis 9{{h') is true. Let us introduce v = u — Y*?eg ^f ■ We have shown that for all g in g , 
Of = Ily^foB (V|g + £f). As a result, the pair {v,^} satisfies the optimality conditions (9) of prob- 
lem (8). Therefore, after one complete pass over g € Q , the primal/dual pair {v,^} is optimal, and 
in particular, v is the solution of problem (7). ■ 

Using conic duality, we have derived a dual formulation of the proximal operator, leading to Algo- 
rithm 1 which is generic and works for any norm ||.||, as long as one is able to perform projections 
onto balls of the dual norm ||.||*. We have further shown that when |.| is the I2- or the £oo-norm, a 
single pass provides the exact solution when the groups Q are correctly ordered. We show however 
in Appendix C, that, perhaps surprisingly, the conclusions of Proposition 1 do not hold for general 
£ 9 -norms, if q ^ {1,2, 00}. Next, we give another interpretation of this result. 
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3.4 Interpretation in Terms of Composition of Proximal Operators 

In Algorithm 1, since all the vectors if are initialized to 0, when the group g is considered, we 
have by induction u — Y*h=£g^ = u — Y*h<g^- Thus, to maintain at each iteration of the inner loop 
v = u — Y,h^gt> k one can instead update v after updating If according to v <— v — If . Moreover, 
since if is no longer needed in the algorithm, and since only the entries of v indexed by g are 
updated, we can combine the two updates into v L? <— \ lg — H\\.\\ t <xa> (V| g ), leading to a simplified 
Algorithm 2 equivalent to Algorithm 1 . 

Algorithm 2 Practical Computation of the Proximal Operator for £2- or £oo-norms. 

Inputs: u € M p and an ordered tree-structured set of groups Q . 

Outputs: v (primal solution). 

Initialization: v = u. 

for g G Q , following the order -<, do 

v lg^ v IS- n i|.||,<Xa, s (V|s). 

end for 



Actually, in light of the classical relationship between proximal operator and projection (as 
discussed in Section 3.1), it is easy to show that each update V| g <— V| g — IT|| .||,<Xxd is equivalent 
to V|g <— Prox^n n [v| g ]. To simplify the notations, we define the proximal operator for a group g in 

g as Prox g (u) = Prox^n || (u^) for every vector u in W\ 

Thus, Algorithm 2 in fact performs a sequence of | Q | proximal operators, and we have shown 
the following corollary of Proposition 1 : 

Corollary 1 (Composition of Proximal Operators) 

Let g\ =4 ... ^, g m such that Q = {gi,. . . ,g m }- The proximal operator Pwx\q associated with the 
norm Q. can be written as the composition of elementary operators: 

Prox%Q = Prox 8 '" o . . . o Prox 81 . 
3.5 Efficient Implementation and Complexity 

Since Algorithm 2 involves \ g\ projections on the dual balls (respectively the £2- and the ^i-balls 
for the £2- and ^-norms) of vectors in M p , in a first approximation, its complexity is at most 0(p 2 ), 
because each of these projections can be computed in O(p) operations (Brucker, 1984; Maculan 
and Galdino de Paula, 1989). But in fact, the algorithm performs one projection for each group g 
involving \g\ variables, and the total complexity is therefore O^^geg \g\J- By noticing that if g 
and h are two groups with the same depth in the tree, then g n h = 0, it is easy to show that the 
number of variables involved in all the projections is less than or equal to dp, where d is the depth 
of the tree: 

Lemma 3 (Complexity of Algorithm 2) 

Algorithm 2 gives the solution of the primal problem Eq. (7) in O(pd) operations, where d is the 
depth of the tree. 

Lemma 3 should not suggest that the complexity is linear in p, since d could depend of p as well, 
and in the worst case the hierarchy is a chain, yielding d = p — 1 . However, in a balanced tree, 
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Algorithm 3 Fast computation of the Proximal operator for ^-norm case. 

Require: u G R p (input vector), set of groups g, ((Og) g( zg (positive weights), and go (root of the 
tree). 

1: Variables: p = (p g ) ge g in Rl^ (scaling factors); v in M p (output, primal variable). 

2: computeSqNorm(go)- 

3: recursiveScaling(go,l)- 

4: Return v (primal solution). 

Procedure computeSqNorm(g) 

1: Compute the squared norm of the group: r\ g <— ||u root ( g ) ||| + L/iechiidren(g) computeSqNorm(/i). 

2: Compute the scaling factor of the group: p g <— (l — XcOg/- v /f|^) + . 

3: Return rigp^. 
Procedure recursiveScaling(g,f) 

1: Pg^t Pg . 

2 ; v root(g) ^~ Pg u root(g)- 

3: for h G children (g) do 

4: recursiveScaling(/z,p g ). 
5: end for 



d = 0(log(p)). In practice, the structures we have considered experimentally are relatively flat, 
with a depth not exceeding d = 5, and the complexity is therefore almost linear. 

Moreover, in the case of the ^2-norm, it is actually possible to propose an algorithm with com- 
plexity O(p). Indeed, in that case each of the proximal operators Prox g is a scaling operation: 
v lg (l — ^ ro g/ll v igl|2) + V|g. The composition of these operators in Algorithm 1 thus corresponds 
to performing sequences of scaling operations. The idea behind Algorithm 3 is that the correspond- 
ing scaling factors depend only on the norms of the successive residuals of the projections and that 
these norms can be computed recursively in one pass through all nodes in 0{p) operations; finally, 
computing and applying all scalings to each entry takes then again O(p) operations. 

To formulate the algorithm, two new notations are used: for a group g in Q , we denote by root(g) 
the indices of the variables that are at the root of the subtree corresponding to g, w and by children(g) 
the set of groups that are the children of root(g) in the tree. For example, in the tree presented 
inFigure2,root({3,5,6}) = {3}, root({l,2,3,4,5,6}) = {1}, children({3,5,6}) = {{5},{6}}, and 
children({l,2,3,4,5,6}) = {{2,4},{3,5,6}}. Note that all the groups of children (g) are necessarily 
included in g. The next lemma is proved in Appendix B. 

Lemma 4 (Correctness and complexity of Algorithm 3) 

When \\.\\ is chosen to be the l^-norm, Algorithm 3 gives the solution of the primal problem Eq. (7) 
in O(p) operations. 

So far the dictionary D was fixed to be for example a wavelet basis. In the next section, we apply 
the tools we developed for solving efficiently problem (5) to learn a dictionary D adapted to our 
hierarchical sparse coding formulation. 

10. As a reminder, root(g) is not a singleton when several dictionary elements are considered per node. 
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4. Application to Dictionary Learning 

We start by briefly describing dictionary learning. 



4.1 The Dictionary Learning Framework 

Let us consider a set X = [x 1 , . . . ,x n ] in W nxn of n signals of dimension m. Dictionary learning is a 
matrix factorization problem which aims at representing these signals as linear combinations of the 
dictionary elements, that are the columns of a matrix D = [d , . . . ,d p ] in W 1ixp . More precisely, the 
dictionary D is learned along with a matrix of decomposition coefficients A = [a 1 , . . . , a"] in W xn , 
so that x ! « Da' for every signal x ! . 

While learning simultaneously D and A, one may want to encode specific prior knowledge 
about the problem at hand, such as, for example, the positivity of the decomposition (Lee and 
Seung, 1999), or the sparsity of A (Olshausen and Field, 1997; Aharon et al., 2006; Lee et al., 2007; 
Mairal et al., 2010a). This leads to penalizing or constraining (D,A) and results in the following 
formulation: 



1 



mm 



-Y -||x i -Da i '||? + X,»P(a i ' 



(10) 



where SI and © denote two convex sets and *F is a regularization term, usually a norm or a squared 
norm, whose effect is controlled by the regularization parameter A- > 0. Note that © is assumed to be 
bounded to avoid any degenerate solutions of Problem (10). For instance, the standard sparse coding 
formulation takes *P to be the £i-norm, © to be the set of matrices in W" xp whose columns have 
unit ^-norm, with A = W xn (Olshausen and Field, 1997; Lee et al., 2007; Mairal et al., 2010a). 

However, this classical setting treats each dictionary element independently from the others, and 
does not exploit possible relationships between them. To embed the dictionary in a tree structure, 
we therefore replace the ^i-norm by our hierarchical norm and set *P = Q. in Eq. (10). 

A question of interest is whether hierarchical priors are more appropriate in supervised settings 
or in the matrix-factorization context in which we use it. It is not so common in the supervised 
setting to have strong prior information that allows us to organize the features in a hierarchy. On 
the contrary, in the case of dictionary learning, since the atoms are learned, one can argue that the 
dictionary elements learned will have to match well the hierarchical prior that is imposed by the 
regularization. In other words, combining structured regularization with dictionary learning has 
precisely the advantage that the dictionary elements will self-organize to match the prior. 



4.2 Learning the Dictionary 

Optimization for dictionary learning has already been intensively studied. We choose in this paper a 
typical alternating scheme, which optimizes in turn D and A = [a 1 , . . . , a"] while keeping the other 
variable fixed (Aharon et al, 2006; Lee et al., 2007; Mairal et al., 2010a). 11 Of course, the convex 
optimization tools we develop in this paper do not change the intrinsic non-convex nature of the 
dictionary learning problem. However, they solve the underlying convex subproblems efficiently, 
which is crucial to yield good results in practice. In the next section, we report good performance 
on some applied problems, and we show empirically that our algorithm is stable and does not seem 
to get trapped in bad local minima. The main difficulty of our problem lies in the optimization of 

1 1 . Note that although we use this classical scheme for simplicity, it would also be possible to use the stochastic approach 
proposed by Mairal et al. (2010a). 



14 



Proximal Methods for Hierarchical Sparse Coding 



the vectors a', i in {1, . . . ,n}, for the dictionary D kept fixed. Because of £2, the corresponding 
convex subproblem is nonsmooth and has to be solved for each of the n signals considered. The 
optimization of the dictionary D (for A fixed), which we discuss first, is in general easier. 

Updating the dictionary D. We follow the matrix-inversion free procedure of Mairal et al. (2010a) 
to update the dictionary. This method consists in iterating block-coordinate descent over the columns 
of D. Specifically, we assume that the domain set © has the form 

® (i = {D 6 R mxp , 4* J 'h + (1 - MWWl < 1, for all j € {1,.. . ,p}}, (11) 

or £>+ = £> jU nM™ xp , with /j G [0,1]. The choice for these particular domain sets is motivated 
by the experiments of Section 5. For natural image patches, the dictionary elements are usually 
constrained to be in the unit ^2-norm ball (i.e., £> = ©o)> while for topic modeling, the dictionary 
elements are distributions of words and therefore belong to the simplex (i.e., <D = 2>j~). The update 
of each dictionary element amounts to performing a Euclidean projection, which can be computed 
efficiently (Mairal et al., 2010a). Concerning the stopping criterion, we follow the strategy from the 
same authors and go over the columns of D only a few times, typically 5 times in our experiments. 
Although we have not explored locality constraints on the dictionary elements, these have been 
shown to be particularly relevant to some applications such as patch-based image classification (Yu 
et al., 2009). Combining tree structure and locality constraints is an interesting future research. 

Updating the vectors a'. The procedure for updating the columns of A is based on the results 
derived in Section 3.3. Furthermore, positivity constraints can be added on the domain of A, by 
noticing that for our norm £2 and any vector u in ]R P , adding these constraints when computing the 
proximal operator is equivalent to solving min V dRp ^|| +AH(v). This equivalence is proved 

in Appendix B.6. We will indeed use positive decompositions to model text corpora in Section 5. 
Note that by constraining the decompositions a' to be nonnegative, some entries a'j may be set 
to zero in addition to those already zeroed out by the norm £1. As a result, the sparsity patterns 
obtained in this way might not satisfy the tree-structured condition (1) anymore. 

5. Experiments 

We next turn to the experimental validation of our hierarchical sparse coding. 
5.1 Implementation Details 

In Section 3.3, we have shown that the proximal operator associated to Q. can be computed exactly 
and efficiently. The problem is therefore amenable to fast proximal algorithms that are well suited to 
nonsmooth convex optimization. Specifically, we tried the accelerated scheme from both Nesterov 
(2007) and Beck and Teboulle (2009), and finally opted for the latter since, for a comparable level of 
precision, fewer calls of the proximal operator are required. The basic proximal scheme presented 
in Section 3.1 is formalized by Beck and Teboulle (2009) as an algorithm called ISTA; the same 
authors propose moreover an accelerated variant, FISTA, which is a similar procedure, except that 
the operator is not directly applied on the current estimate, but on an auxiliary sequence of points 
that are linear combinations of past estimates. This latter algorithm has an optimal convergence 
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rate in the class of first-order techniques, and also allows for warm restarts, which is crucial in the 
alternating scheme of dictionary learning. 12 

Finally, we monitor the convergence of the algorithm by checking the relative decrease in the 
cost function. 13 Unless otherwise specified, all the algorithms used in the following experiments 
are implemented in C/C++, with a Matlab interface. Our implementation is freely available at 
http: //www.di .ens . f r/willow/SPAMS/. 

5.2 Speed Benchmark 

To begin with, we conduct speed comparisons between our approach and other convex programming 
methods, in the setting where Q. is chosen to be a linear combination of ^-norms. The algorithms 
that take part in the following benchmark are: 

• Proximal methods, with ISTA and the accelerated FISTA methods (Beck and Teboulle, 2009). 

• A reweighted-least-square scheme (Re-^2)> as described by Jenatton et al. (2009); Kim and Xing 
(2010). This approach is adapted to the square loss, since closed-form updates can be used. 14 

• Subgradient descent, whose step size is taken to be equal either to a/(k + b) or aj{yk + b) (re- 
spectively referred to as SG and SG sqrt ), where k is the iteration number, and (a,b) are the best 15 
parameters selected on the logarithmic grid (a,b) 6 {10~ 4 , . . . , 10 3 } x {10~ 2 , . . . , 10 5 }. 

• A commercial software (Mosek, available at http://www.mosek.com/) for second-order cone 
programming (SOCP). 

Moreover, the experiments we carry out cover various settings, with notably different sparsity 
regimes, i.e., low, medium and high, respectively corresponding to about 50%, 10% and 1% of 
the total number of dictionary elements. Eventually, all reported results are obtained on a single 
core of a 3.07Ghz CPU with 8GB of memory. 

5.2. 1 Hierarchical dictionary of natural image patches 

In this first benchmark, we consider a least-squares regression problem regularized by Q. that arises 
in the context of denoising of natural image patches, as further exposed in Section 5.4. In particular, 
based on a hierarchical dictionary, we seek to reconstruct noisy 16 x 16-patches. The dictionary 
we use is represented on Figure 7. Although the problem involves a small number of variables, 
i.e., p = 151 dictionary elements, it has to be solved repeatedly for tens of thousands of patches, at 
moderate precision. It is therefore crucial to be able to solve this problem quickly and efficiently. 

We can draw several conclusions from the results of the simulations reported in Figure 3. First, 
we observe that in most cases, the accelerated proximal scheme performs better than the other 
approaches. In addition, unlike FISTA, ISTA seems to suffer in non-sparse scenarios. In the least 
sparse setting, the reweighted-i?2 scheme is the only method that competes with FISTA. It is however 
not able to yield truly sparse solutions, and would therefore need a subsequent (somewhat arbitrary) 
thresholding operation. As expected, the generic techniques such as SG and SOCP do not compete 
with dedicated algorithms. 

12. Unless otherwise specified, the initial stepsize in ISTA/FISTA is chosen as the maximum eigenvalue of the sampling 
covariance matrix divided by 100, while the growth factor in the line search is set to 1.5. 

13. We are currently investigating algorithms for computing duality gaps based on network flow optimization 
tools (Mairal et al., 2010b). 

14. The computation of the updates related to the variational formulation (6) also benefits from the hierarchical structure 
of Q , and can be performed in O(p) operations. 

15. "The best step size" is understood as being the step size leading to the smallest cost function after 500 iterations. 



16 



Proximal Methods for Hierarchical Sparse Coding 



2r 2r 2 




log(CPU time in seconds) log(CPU time in seconds) log(CPU time in seconds) 

(a) scale: small, regul.: low (b) scale: small, regul.: medium (c) scale: small, regul.: high 

Figure 3: Benchmark for solving a least-squares regression problem regularized by the hierarchical 
norm £2. The experiment is small scale, m = 256, p = 151, and shows the performances of six opti- 
mization methods (see main text for details) for three levels of regularization. The curves represent 
the relative value of the objective to the optimal value as a function of the computational time in 
second on a log 10 /log 10 scale. All reported results are obtained by averaging 5 runs. 




log(CPU time in seconds) log(CPU time in seconds) log(CPU time in seconds) 



(a) scale: large, regul.: low (b) scale: large, regul.: medium (c) scale: large, regul.: high 

Figure 4: Benchmark for solving a large-scale multi-class classification problem for four optimiza- 
tion methods (see details about the datasets and the methods in the main text). Three levels of 
regularization are considered. The curves represent the relative value of the objective to the optimal 
value as a function of the computational time in second on a log 10 /log 10 scale. In the highly regu- 
larized setting, tuning the step-size for the subgradient turned out to be difficult, which explains the 
behavior of SG in the first iterations. 



5.2.2 Multi-class classification of cancer diagnosis 

The second benchmark explores a different supervised learning setting, where / is no longer the 
square loss function. The goal is to demonstrate that our optimization tools apply in various scenar- 
ios, beyond traditional sparse approximation problems. To this end, we consider a gene expression 
dataset 16 in the context of cancer diagnosis. More precisely, we focus on a multi-class classification 
problem where the number m of samples to be classified is small compared to the number p of 

16. The dataset we use is 14 Juniors, which is freely available at http : / /www . gems -system, org/. 
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gene expressions that characterize these samples. Each atom thus corresponds to a gene expression 
across the m samples, whose class labels are recorded in the vector x in W n . 

The dataset contains m = 308 samples, p = 30017 variables and 26 classes. In addition, the 
data exhibit highly-correlated dictionary elements. Inspired by Kim and Xing (2010), we build the 
tree-structured set of groups g using Ward's hierarchical clustering (Johnson, 1967) on the gene 
expressions. The norm Q. built in this way aims at capturing the hierarchical structure of gene 
expression networks (Kim and Xing, 2010). 

Instead of the square loss function, we consider the multinomial logistic loss function that is 
better suited to deal with multi-class classification problems (see, e.g., Hastie et al., 2009). As 
a direct consequence, algorithms whose applicability crucially depends on the choice of the loss 
function / are removed from the benchmark. This is the case with reweighted-^2 schemes that do 
not have closed-form updates anymore. Importantly, the choice of the multinomial logistic loss 
function leads to an optimization problem over a matrix with dimensions p times the number of 
classes (i.e., a total of 30017 x 26 w 780000 variables). Also, due to scalability issues, generic 
interior point solvers could not be considered here. 

The results in Figure 4 highlight that the accelerated proximal scheme performs overall better 
that the two other methods. Again, it is important to note that both proximal algorithms yield sparse 
solutions, which is not the case for SG. 

5.3 Denoising with Tree-Structured Wavelets 

We demonstrate in this section how a tree-structured sparse regularization can improve classi- 
cal wavelet representation, and how our method can be used to efficiently solve the correspond- 
ing large-scale optimization problems. We consider two wavelet orthonormal bases, Haar and 
Daubechies3 (see Mallat, 1999), and choose a classical quad-tree structure on the coefficients, which 
has notably proven to be useful for image compression problems (Baraniuk, 1999). This experiment 
follows the approach of Zhao et al. (2009) who used the same tree-structured regularization in the 
case of small one-dimensional signals, and the approach of Baraniuk et al. (2010) and Huang et al. 
(2009) images where images were reconstructed from compressed sensing measurements with a 
hierarchical nonconvex penalty. 

We compare the performance for image denoising of both nonconvex and convex approaches. 
Specifically, we consider the following formulation 

min -llx — Dall? + X\|/(al = min -||D T x — all, +X\i/(a), 
aeM"'2 aeR'»2 

where D is one of the orthonormal wavelet basis mentioned above, x is the input noisy image, Da 
is the estimate of the denoised image, and \j/ is a sparsity-inducing regularization. Note that in this 
case, m = p. We first consider classical settings where \|/ is either the £i-norm — this leads to the 
wavelet soft-thresholding method of Donoho and Johnstone (1995) — or the ^o-pseudo-norm, whose 
solution can be obtained by hard-thresholding (see Mallat, 1999). Then, we consider the convex 
tree-structured regularization Q. defined as a sum of ^2-norms (^-norms), which we denote by Q.^ 
(respectively Q&J. Since the basis is here orthonormal, solving the corresponding decomposition 
problems amounts to computing a single instance of the proximal operator. As a result, when \|/ 
is Cle 2 , we use Algorithm 3 and for Algorithm 2 is applied. Finally, we consider the nonconvex 
tree-structured regularization used by Baraniuk et al. (2010) denoted here by l^, which we have 
presented in Eq. (4); the implementation details for £^ ee can be found in Appendix A. 
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Table 1: Top part of the tables: Average PSNR measured for the denoising of 12 standard im- 
ages, when the wavelets are Haar or Daubechies3 wavelets (see Mallat, 1999), for two nonconvex 
approaches (£q and ^q 66 ) and three different convex regularizations — that is, the ^i-norm, the tree- 
structured sum of ^2-norms (^c 2 ), and the tree-structured sum of ^oo-norms (£1^). Best results for 
each level of noise and each wavelet type are in bold. Bottom part of the tables: Average improve- 
ment in PSNR with respect to the £0 nonconvex method (the standard deviations are computed over 
the 12 images). CPU times (in second) averaged over all images and noise realizations are reported 
in brackets next to the names of the methods they correspond to. 



Compared to Zhao et al. (2009), the novelty of our approach is essentially to be able to solve 
efficiently and exactly large-scale instances of this problem. We use 12 classical standard test im- 
ages, 17 and generate noisy versions of them corrupted by a white Gaussian noise of variance a. For 
each image, we test several values of X = 2*0\/\ogm, with i taken in a specific range. 18 We then 
keep the parameter A. giving the best reconstruction error. The factor a^/logm is a classical heuristic 
for choosing a reasonable regularization parameter (see Mallat, 1999). We provide reconstruction 



17. These images are used in classical image denoising benchmarks. See Mairal et al. (2009b). 

18. For the convex formulations, ( ranges in {—15,-14, . . . , 15}, while in the nonconvex case i ranges in {—24, . . . ,48}. 
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results in terms of PSNR in Table 1. We report in this table the results when Q. is chosen to 
be a sum of i?2-norms or foo-norms with weights (O g all equal to one. Each experiment was run 5 
times with different noise realizations. In every setting, we observe that the tree-structured norm 
significantly outperforms the £i-norm and the nonconvex approaches. We also present a visual com- 
parison on two images on Figure 5, showing that the tree-structured norm reduces visual artefacts 
(these artefacts are better seen by zooming on a computer screen). The wavelet transforms in our 
experiments are computed with the matlabPyrTools software. 20 




(a) Lena, a = 25, i\ (b) Lena, a = 25, Sl h (c) Barb., a = 50, l\ (d) Barb., a = 50, % 



Figure 5: Visual comparison between the wavelet shrinkage model with the ^i-norm and the tree- 
structured model, on cropped versions of the images Lena and Barb.. Haar wavelets are used. 

This experiment does of course not provide state-of-the-art results for image denoising (see 
Mairal et al., 2009b, and references therein), but shows that the tree-structured regularization sig- 
nificantly improves the reconstruction quality for wavelets. In this experiment the convex set- 
ting Q.( 2 and Q.^ also outperforms the nonconvex one £^ es . 21 We also note that the speed of our 
approach makes it scalable to real-time applications. Solving the proximal problem for an image 
with m = 512 x 512 = 262144 pixels takes approximately 0.013 seconds on a single core of a 
3.07GHz CPU if Q. is a sum of ^2-norms, and 0.02 seconds when it is a sum of ^ -norms. By con- 
trast, unstructured approaches have a speed-up factor of about 7-8 with respect to the tree-structured 
methods. 

5.4 Dictionaries of Natural Image Patches 

This experiment studies whether a hierarchical structure can help dictionaries for denoising natural 
image patches, and in which noise regime the potential gain is significant. We aim at reconstructing 
corrupted patches from a test set, after having learned dictionaries on a training set of non-corrupted 
patches. Though not typical in machine learning, this setting is reasonable in the context of images, 
where lots of non-corrupted patches are easily available. 22 

19. Denoting by MSE the mean-squared-error for images whose intensities are between and 255, the PSNR is denned 
as PSNR = 101og 10 (255 2 /MSE) and is measured in dB. A gain of ldB reduces the MSE by approximately 20%. 

20. http : //www. ens . nyu . edu/~eero/steerpyr/. 

21. It is worth mentioning that comparing convex and nonconvex approaches for sparse regularization is a bit difficult. 
This conclusion holds for the classical formulation we have used, but might not hold in other settings such as Coifman 
and Donoho (1995). 

22. Note that we study the ability of the model to reconstruct independent patches, and additional work is required to 
apply our framework to a full image processing task, where patches usually overlap (Elad and Aharon, 2006; Mairal 
et al., 2009b). 
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noise 


50% 


60% 


70% 


80% 


90% 


flat 


19.3±0.1 


26.8±0.1 


36.7±0.1 


50.6 ±0.0 


72.1 ±0.0 


tree 


18.6±0.1 


25.7±0.1 


35.0±0.1 


48.0 ±0.0 


65.9 ±0.3 



Table 2: Quantitative results of the reconstruction task on natural image patches. First row: percent- 
age of missing pixels. Second and third rows: mean square error multiplied by 100, respectively for 
classical sparse coding, and tree-structured sparse coding. 




16 21 31 41 61 81 121 161 181 241 301 321 401 



Figure 6: Mean square error multiplied by 100 obtained with 13 structures with error bars, sorted 
by number of dictionary elements from 16 to 401. Red plain bars represents the tree-structured 
dictionaries. White bars correspond to the flat dictionary model containing the same number of 
dictionary as the tree-structured one. For readability purpose, the j-axis of the graph starts at 50. 

We extracted 100000 patches of size m = 8 x 8 pixels from the Berkeley segmentation database 
of natural images (Martin et al., 2001), which contains a high variability of scenes. We then split 
this dataset into a training set X rr , a validation set X vo /, and a test set X te , respectively of size 50000, 
25000, and 25000 patches. All the patches are centered and normalized to have unit ^-norm. 

For the first experiment, the dictionary D is learned on X tr using the formulation of Eq. (10), 
with jj = for as defined in Eq. (11). The validation and test sets are corrupted by removing 
a certain percentage of pixels, the task being to reconstruct the missing pixels from the known 
pixels. We thus introduce for each element x of the validation/test set, a vector x, equal to x for the 
known pixel values and otherwise. Similarly, we define D as the matrix equal to D, except for 
the rows corresponding to missing pixel values, which are set to 0. By decomposing x on D, we 
obtain a sparse code a, and the estimate of the reconstructed patch is defined as Da. Note that this 
procedure assumes that we know which pixel is missing and which is not for every element x. 

The parameters of the experiment are the regularization parameter X tr used during the training 
step, the regularization parameter \ te used during the validation/test step, and the structure of the 
tree. For every reported result, these parameters were selected by taking the ones offering the 
best performance on the validation set, before reporting any result from the test set. The values 
for the regularization parameters 'k tr ,'K te were selected on a logarithmic scale {2~ 10 ,2~ 9 , . . . ,2 2 }, 
and then further refined on a finer logarithmic scale with multiplicative increments of 2 -1 / 4 . For 
simplicity, we chose arbitrarily to use the ^-norm in the structured norm Q., with all the weights 
equal to one. We tested 21 balanced tree structures of depth 3 and 4, with different branching 
factors pi,p2,... ,Pd-i> where d is the depth of the tree and p\, k G {1, . . . ,d — 1} is the number 
of children for the nodes at depth k. The branching factors tested for the trees of depth 3 where 
pi G {5, 10,20,40,60,80, 100}, p 2 G {2,3}, and for trees of depth 4, pi G {5, 10,20,40}, p 2 G {2,3} 
and pi = 2, giving 21 possible structures associated with dictionaries with at most 401 elements. For 
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Figure 7: Learned dictionary with a tree structure of depth 5. The root of the tree is in the middle of 
the figure. The branching factors are p\ = 10, p2 = 2, p^ =2, p4 = 2. The dictionary is learned on 
50,000 patches of size 16 x 16 pixels. 

each tree structure, we evaluated the performance obtained with the tree-structured dictionary along 
with a non-structured dictionary containing the same number of elements. These experiments were 
earned out four times, each time with a different initialization, and with a different noise realization. 

Quantitative results are reported in Table 2. For all fractions of missing pixels considered, the 
tree-structured dictionary outperforms the "unstructured one", and the most significant improvement 
is obtained in the noisiest setting. Note that having more dictionary elements is worthwhile when 
using the tree structure. To study the influence of the chosen structure, we report in Figure 6 the 
results obtained with the 13 tested structures of depth 3, along with those obtained with unstructured 
dictionaries containing the same number of elements, when 90% of the pixels are missing. For 
each dictionary size, the tree-structured dictionary significantly outperforms the unstructured one. 
An example of a learned tree-structured dictionary is presented on Figure 7. Dictionary elements 
naturally organize in groups of patches, often with low frequencies near the root of the tree, and 
high frequencies near the leaves. 

5.5 Text Documents 

This last experimental section shows that our approach can also be applied to model text corpora. 
The goal of probabilistic topic models is to find a low-dimensional representation of a collection 
of documents, where the representation should provide a semantic description of the collection. 
Approaching the problem in a parametric Bayesian framework, latent Dirichlet allocation (LDA) 
Blei et al. (2003) model documents, represented as vectors of word counts, as a mixture of a prede- 
fined number of latent topics that are distributions over a fixed vocabulary. LDA is fundamentally 
a matrix factorization problem: Buntine (2002) shows that LDA can be interpreted as a Dirichlet- 
multinomial counterpart of factor analysis. The number of topics is usually small compared to the 
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size of the vocabulary (e.g., 100 against 10000), so that the topic proportions of each document 
provide a compact representation of the corpus. For instance, these new features can be used to feed 
a classifier in a subsequent classification task. We similarly use our dictionary learning approach to 
find low-dimensional representations of text corpora. 

Suppose that the signals X = [x 1 , . . . ,x"] in W" xn are each the bag-of-word representation of 
each of n documents over a vocabulary of m words, the k-th component of x' standing for the 
frequency of the k-th word in the document i. If we further assume that the entries of D and A 
are nonnegative, and that the dictionary elements d ; have unit ^i-norm, the decomposition (D,A) 
can be interpreted as the parameters of a topic-mixture model. The regularization Q. induces the 
organization of these topics on a tree, so that, if a document involves a certain topic, then all ancestral 
topics in the tree are also present in the topic decomposition. Since the hierarchy is shared by all 
documents, the topics at the top of the tree participate in every decomposition, and should therefore 
gather the lexicon which is common to all documents. Conversely, the deeper the topics in the tree, 
the more specific they should be. An extension of LDA to model topic hierarchies was proposed 
by Blei et al. (2010), who introduced a non-parametric Bayesian prior over trees of topics and 
modelled documents as convex combinations of topics selected along a path in the hierarchy. We 
plan to compare our approach with this model in future work. 
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Figure 8: Example of a topic hierarchy estimated from 1714 NIPS proceedings papers (from 1988 
through 1999). Each node corresponds to a topic whose 5 most important words are displayed. 
Single characters such as n,t, r are part of the vocabulary and often appear in NIPS papers, and their 
place in the hierarchy is semantically relevant to children topics. 



Visualization of NIPS proceedings. We qualitatively illustrate our approach on the NIPS pro- 
ceedings from 1988 through 1999 (Griffiths and Steyvers, 2004). After removing words appearing 
fewer than 10 times, the dataset is composed of 1714 articles, with a vocabulary of 8274 words. As 
explained above, we consider ©j + and take !A to be n . Figure 8 displays an example of a learned 
dictionary with 13 topics, obtained by using the £oo-norm in Q. and selecting manually X = 2~ 15 . As 
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Figure 9: Binary classification of two newsgroups: classification accuracy for different dimen- 
sionality reduction techniques coupled with a linear SVM classifier. The bars and the errors are 
respectively the mean and the standard deviation, based on 10 random splits of the dataset. Best 
seen in color. 

expected and similarly to Blei et al. (2010), we capture the stopwords at the root of the tree, and 
topics reflecting the different subdomains of the conference such as neurosciences, optimization or 
learning theory. 

Posting classification. We now consider a binary classification task of n postings from the 20 
Newsgroups data set. 2 We learn to discriminate between the postings from the two newsgroups 
alt.atheism and talk.religion.misc, following the setting of Lacoste-Julien et al. (2008) and Zhu et al. 
(2009). After removing words appearing fewer than 10 times and standard stopwords, these post- 
ings form a data set of 1425 documents over a vocabulary of 13312 words. We compare different 
dimensionality reduction techniques that we use to feed a linear SVM classifier, i.e., we consider (i) 
LDA, with the code from Blei et al. (2003), (ii) principal component analysis (PCA), (iii) nonneg- 
ative matrix factorization (NMF), (iv) standard sparse dictionary learning (denoted by SpDL) and 
(v) our sparse hierarchical approach (denoted by SpHDL). Both SpDL and SpHDL are optimized 
over ©j + and !A =]R^ X ", with the weights (Og equal to 1. We proceed as follows: given a random 
split into a training/test set of 1000/425 postings, and given a number of topics p (also the number 
of components for PCA, NMF, SpDL and SpHDL), we train an SVM classifier based on the low- 
dimensional representation of the postings. This is performed on a training set of 1 000 postings, 
where the parameters, A-G {2~ 26 , . . . ,2~ 5 } and/or C svm £ {4~ 3 , ... ,4'} are selected by 5-fold cross- 
validation. We report in Figure 9 the average classification scores on the test set of 425 postings, 
based on 10 random splits, for different number of topics. Unlike the experiment on image patches, 
we consider only complete binary trees with depths in {1, ... ,5}. The results from Figure 9 show 
that SpDL and SpHDL perform better than the other dimensionality reduction techniques on this 
task. As a baseline, the SVM classifier applied directly to the raw data (the 13312 words) obtains a 
score of 90.9±1.1, which is better than all the tested methods, but without dimensionality reduction 
(as already reported by Blei et al., 2003). Moreover, the error bars indicate that, though nonconvex, 



23. Available at http : / /people . csail .mit . edu/ jrennie/20Newsgroups/. 
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SpDL and SpHDL do not seem to suffer much from instability issues. Even if SpDL and SpHDL 
perform similarly, SpHDL has the advantage to provide a more interpretable topic mixture in terms 
of hierarchy, which standard unstructured sparse coding does not. 

6. Discussion 

We have applied hierarchical sparse coding in various settings, with fixed/learned dictionaries, and 
based on different types of data, namely, natural images and text documents. A line of research to 
pursue is to develop other optimization tools for structured norms with general overlapping groups. 
For instance, Mairal et al. (2010b) have used network flow optimization techniques for that purpose, 
and Bach (2010) submodular function optimization. This framework can also be used in the context 
of hierarchical kernel learning (Bach, 2008), where we believe that our method can be more efficient 
than existing ones. 

This work establishes a connection between dictionary learning and probabilistic topic mod- 
els, which should prove fruitful as the two lines of work have focused on different aspects of the 
same unsupervised learning problem: Our approach is based on convex optimization tools, and pro- 
vides experimentally more stable data representations. Moreover, it can be easily extended with the 
same tools to other types of structures corresponding to other norms (Jenatton et al., 2009; Jacob 
et al., 2009). It should be noted, however, that, unlike some Bayesian methods, dictionary learn- 
ing by itself does not provide mechanisms for the automatic selection of model hyper-parameters 
(such as the dictionary size or the topology of the tree). An interesting common line of research 
to pursue could be the supervised design of dictionaries, which has been proved useful in the two 
frameworks (Mairal et al., 2009a; Bradley and Bagnell, 2009; Blei and McAuliffe, 2008). 
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Appendix A. Links with Tree- Structured Nonconvex Regularization 

We present in this section an algorithm introduced by Donoho (1997) in the more general context 
of approximation from dyadic partitions (see Section 6 in Donoho, 1997). This algorithm solves the 
following problem 

min-||u-v||? + ^y 8*(v), (12) 

where the u in M p is given, X is a regularization parameter, Q is a set of tree-structured groups in 
the sense of definition 1, and the functions 8 s are defined as in Eq. (4) — that is, 8 g (v) = 1 if there 
exists j in g such that \j ^ 0, and otherwise. This problem can be viewed as a proximal operator 
for the nonconvex regularization Y*geq § g (v). As we will show, it can be solved efficiently, and in 
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fact it can be used to obtain approximate solutions of the nonconvex problem presented in Eq. (1), 
or to solve tree-structured wavelet decompositions as done by Baraniuk et al. (2010). 

We now briefly show how to derive the dynamic programming approach introduced by Donoho 
(1997). Given a group g in Q, we use the same notations root(g) and children(g) introduced in 
Section 3.5. It is relatively easy to show that finding a solution of Eq. (12) amounts to finding the 
support 5 C {1 , . . . , p} of its solution and that the problem can be equivalently rewritten 

min I||u 5 ||| + ?i (13) 

sc{l,...,p} 2 geg 

with the abusive notation 8 8 (S) = 1 if gC\S ^ and otherwise. We now introduce the quantity 

ifgnS = 

-iKootfe) ||| + A.+ Iftechiidrenfe) Vh(S) otherwise. 

After a few computations, solving Eq. (13) can be shown to be equivalent to minimizing \|%(>S) 
where go is the root of the tree. It is then easy to prove that for any group g in Q , we have 

mm =rmn(o,--||u root ( g )||l + X,-r- Y min vr h (S')), 

W...*} V 2 Aechildrenfe) 5 ^ 1 '-'^ ' 

which leads to the following dynamic programming approach presented in Algorithm 4. This al- 

Algorithm 4 Computation of the Proximal Operator for the Nonconvex Approach 

Inputs: u 6 W ', a tree-structured set of groups Q and go (root of the tree). 
Outputs: v (primal solution). 
Initialization: v <— u. 
Call recursiveThresholding(go)- 
Procedure recur siveThresholding(^) 

1: r| ^— min (0,-5 ||u root ( g )||2 + ^ + £/techildren(g) recursiveThresholding(/z)^ . 
2: if r\ = then 

3: \ g <r- 0. 

4: end if 
5: Return Tj. 



gorithm shares several conceptual links with Algorithm 2 and 3. It traverses the tree in the same 
order, has a complexity in O(p), and it can be shown that the whole procedure actually performs a 
sequence of thresholding operations on the variable v. 

Appendix B. Proofs 

We gather here the proofs of the technical results of the paper. 
B.l Proof of Lemma 1 

Proof The proof relies on tools from conic duality (Boyd and Vandenberghe, 2004). Let us intro- 
duce the cone C = {(v,z) £M p+l ; ||v|| <z} and its dual counterpart C* = {(£,t) <%}. 
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These cones induce generalized inequalities for which Lagrangian duality also applies. We refer the 
interested readers to Boyd and Vandenberghe (2004) for further details. 
We can rewrite problem (7) as 

min -||u-v||l + X £ (Q g z g , such that (y ]g ,z g ) G C, VgG Q, 
veffip,zeRlsl 2 ^ 

by introducing the primal variables z = {z g ) g eg G IK' 5 ', with the additional \g\ conic constraints 

(\ lg ,Z g ) G C, forgG Q. 

This primal problem is convex and satisfies Slater's conditions for generalized conic inequalities 
(i.e., existence of a feasible point in the interior of the domain), which implies that strong duality 
holds (Boyd and Vandenberghe, 2004). We now consider the Lagrangian L defined as 

£(v,Z,T^) = i||u-v||l + X£cO^- £ ( Zg ) (l 8 X 
1 g eg g eg V v ls/ VS / 

with the dual variables x = (x g ) g eg in K^, and £ = (ff) g eg in M px l ff l, such that for all g G Q, 
$5 = if j^gmd(^,x g )ec*. " 

The dual function is obtained by minimizing out the primal variables. To this end, we take the 
derivatives of L with respect to the primal variables v and z and set them to zero, which leads to 

v - u - ^ I 8 = and Vg G § , X(O g - l g = 0. 

After simplifying the Lagrangian and flipping (without loss of generality) the sign of $, we obtain the 
dual problem in Eq. (8). We derive the optimality conditions from the Karush-Kuhn-Tucker con- 
ditions for generalized conic inequalities (Boyd and Vandenberghe, 2004). We have that {v,z,x,^} 
are optimal if and only if 

Vg G Q ,z g l g — vji^ = 0, (Complementary slackness) 

VgG g,(y ]g ,z g ) g c, VgG£,taflg-x g =o, 

VgG2,(^,X g )GC*, v-u + L^ =0. 
Combining the complementary slackness with the definition of the dual norm, we have 

VgG g, z^ = vJS'< 

Furthermore, using the fact that Vg G g, (V| g ,z g ) G C and ,x g ) = (t, 8 ,X(O g ) G C* , we obtain the 
following chain of inequalities 

VgG g, Xz g (o g = yJ g t, 8 < ||v|g||||^||* <zj£*||* <te g ® g , 

for which equality must hold. In particular, we have = ||V|g||||£ g ||* and z s ||^||* = kz g ® g - 
If V| g ^ 0, then z g cannot be equal to zero, which implies in turn that ||^||* = X(O g . Eventually, 
applying Lemma 5 gives the advertised optimality conditions. 

Conversely, starting from the optimality conditions of Lemma 1, and making use again of 
Lemma 5, we can derive the Karush-Kuhn-Tucker conditions displayed above. More precisely, 
we define for all g G g, 

T g = XcDg and z. g = ||V| g ||. 
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The only condition that needs to be discussed is the complementary slackness condition. If = 0, 
then it is easily satisfied. Otherwise, combining the definitions of x g , z g and the fact that 

yj g % s '= ||V|J||^||* and ||^||* = X(i) g , 

we end up with the desired complementary slackness. ■ 



B.2 Optimality condition for the projection on the dual ball 

Lemma 5 (Projection on the dual ball) 

Let w G W and t > 0. We have K = II||n t <j(w) if and only if 

'7||w||* <t, K = W, 

otherwise, ||k||*=7 and K T (w — k) = ||k||*||w — k||. 

Proof When the vector w is already in the ball of ||.||* with radius t, i.e., ||w||* < t, the situation 
is simple, since the projection Uu iL< f (w) obviously gives w itself. On the other hand, a necessary 
and sufficient optimality condition for having K = XT[| ,|| t < f (w) = argmiri| y | <f ||w — y 1 1 2 is that the 
residual w — K lies in the normal cone of the constraint set (Borwein and Lewis, 2006), that is, for 
all y such that ||y||* < t, (w — K) T (y — k) < 0. The displayed result then follows from the definition 
of the dual norm, namely ||k||* =max|| z ||< 1 z T K. ■ 



B.3 Proof of Lemma 2 

Proof First, notice that the conclusion £, = ilii .|L<x.a> ft ( v |A + simply comes from the definition 
of if and v, along with the fact that if = & h since g Ch. We now examine If . 

The proof mostly relies on the optimality conditions characterizing the projection onto a ball of 
the dual norm || • ||*. Precisely, by Lemma 5, we need to show that either 

^ = %-^ if 11% -^11* <**, 

or 

\\m*=h and ^ T ( % -^-^) = ||^||,|| % -^-^||. 

Note that the feasibility of if, i.e., ||^||* < t g , holds by definition of K^. 

Let us first assume that \\£f\\* < t g . We necessarily have that U| g also lies in the interior of 
the ball of ||.||* with radius t g , and it holds that if = u )g . Since g C h, we have that the vector 
U|/ 3 — if= U|/, — U| g has only zero entries on g. As a result, t, g = (or equivalently, ^ = 0) and we 
obtain 

which is the desired conclusion. From now on, we assume that ||^ g ||* = t g . It then remains to show 
that 

^K-ft-?) = 11^11.11^-^-^11. 
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We now distinguish two cases, according to the norm used. 

^2-norm: As a consequence of Lemma 5, the optimality condition reduces to the conditions for 
equality in the Cauchy-Schwartz inequality, i.e., when the vectors have same signs and are linearly 
dependent. Applying these conditions to individual projections we get that there exists p ? ,p/, > 
such that 

p g ^ = u lg -^ and p^=up,-if-lj\ (14) 

Note that the case p/, = leads to up, — if — = 0, and therefore U| g — if — if^ g = since g C h, 
which directly yields the result. The case p g = implies U| g — if = and therefore ^ = 0, yielding 
the result as well. Now, we can therefore assume pf, > and p g > 0. From the first equality of (14), 
we have if = if g since (p g + l)if = u^. Further using the fact that g C h in the second equality of 
(14), we obtain 

( P/i +l)^= % -^ = p^. 
This implies that u !g — if — i^ g = p g if — p^+j^, which eventually leads to 

VgVh 

The desired conclusion follows £ gT (u| g — if — i^ g ) = ||^ s ||2||U| g - if - tfgh- 

^oo-norm: In this case, the optimality corresponds to the conditions for equality in the £cc-£i 
Holder inequality. Specifically, if = iT||.|| t <^(u|g) holds if and only if for all 7^ 0,7 G g, we have 

u y-^y = 11% -^ g H~ signup- 
Looking at the same condition for i, , we have that if = Ily^^ (up, — i, g ) holds if and only if for 
all £y 7^ 0,7 G /z, we have 

u,-^-^=||u |ft -^-eiUsign(^). 

From those relationships we notably deduce that for all 7 G g such that ^y ^ 0, sign(^y) = sign(uy) = 

sign(^y) = sign(u ; - — if) = sign(u 7 - - ^y — iff). Let 7 G g such that ^y / 0. At this point, using the 
equalities we have just presented, 

In _P*_E*I = J K~^H~ if^'=0 
1 3 V W \ || U|A -^-eil~ if^VO. 

Since ||U| g — ^ g ||oo > ||U| g — if — ^||oc (which can be shown using the sign equalities above), and 
1 1 u |/ z - if - ^ /l ||oo > ||U| g - If - ^||oo (since g C ft), we have 



u, 



■?-§*iu>iu y -^-§*i>iK-?-4m 



and therefore for all If- ^ 0, j G g, we have u y — ^y — £j = ||u| g — if — ^| g ||ooSign(^y), which yields 
the result. ■ 
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B.4 Proof of Lemma 4 

Proof Notice first that the procedure computeSqNorm is called exactly once for each group g in Q , 
computing a set of scalars (p g ) g eg in an order which is compatible with the convergence in one 
pass of Algorithm 1 — that is, the children of a node are processed prior to the node itself. Following 
such an order, the update of the group g in the original Algorithm 1 computes the variable if which 
updates implicitly the primal variable as follows 



(1 . X(0 z 



It is now possible to show by induction that for all group g in Q, after a call to the procedure 
computeSqNorm(g), the auxiliary variable r\ g takes the value 1 1 v !s . 1 1 ^ where v has the same value as 
during the iteration g of Algorithm 1. Therefore, after calling the procedure computeSqNorm(go), 
where go is the root of the tree, the values p g correspond to the successive scaling factors of the 
variable \ lg obtained during the execution of Algorithm 1. After having computed all the scaling 
factors p g , g G Q , the procedure recursiveScaling ensures that each variable j in {1, ... ,p} is 
scaled by the product of all the pi„ where h is an ancestor of the variable j. 

The complexity of the algorithm is easy to characterize: Each procedure computeSqNorm and 
recursiveScaling is called p times, each call for a group g has a constant number of operations 
plus as many operations as the number of children of p. Since each child can be called at most one 
time, the total number of operation of the algorithm is O(p). ■ 



B.5 Sign conservation by projection 

The next lemma specifies a property for projections when |.| is further assumed to be a 4/-norm 
(with q > 1). We recall that in that case, ||.||* is simply the ^/-norm, with q' = (1 — l/q) . 

Lemma 6 (Projection on the dual ball and sign property) 

Let w G W and t > 0. Let us assume that ||.|| is a £ q -norm (with q > I). Consider also a diagonal 
matrix S G WL pxp whose diagonal entries are in {—1, 1}. We have Tin .|L<r(w) = SITn ir < £ (Sw). 

Proof Let us consider K = Tin u „< ? (w). Using essentially the same argument as in the proof of 
Lemma 5, we have for all y such that \\y\\ q ' < t, (w — K) T (y — k) < 0. Noticing that S T S = I and 
\\y\\ q i = \\Sy\\ q >, we further obtain (Sw — Sk) t (v' — Sk) < for all y' with \\y'\\ q ' < t. This implies in 
turn that SILi iL< f (w) = ILi iL< f (Sw), which is equivalent to the advertised conclusion. ■ 

Based on this lemma, note that we can assume without loss of generality that the vector we want to 
project (in this case, w) has only nonnegative entries. Indeed, it is sufficient to store beforehand the 
signs of that vector, compute the projection of the vector with nonnegative entries, and assign the 
stored signs to the result of the projection. 

B.6 Non-negativity constraint for the proximal operator 

The next lemma shows how we can easily add a non-negativity constraint on the proximal operator 
when the norm Q. is absolute (Stewart and Sun, 1990, Definition 1.2), that is, a norm for which the 
relation £2(u) < £2(w) holds for any two vectors w and uGf such that |u ; | < |w,-| for all j. 
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Lemma 7 (Non-negativity constraint for the proximal operator) 

Let K G W J and X > 0. Consider an absolute norm Q.. We have 



argmm 

zeW 



2 + a£2(z) = argmin -||k— z||f + XQ(z 



zelR'X 



2' 



(15) 



Proof Let us denote by z + and z the unique solutions of the left- and right-hand side of (15) 
respectively. Consider the normal cone (z ) of at the point z (Borwein and Lewis, 2006) 
and decompose K into its positive and negative parts, K = [k]+ + [k]_. We can now write down 
the optimality conditions for the two convex problems above (Borwein and Lewis, 2006): z + is 
optimal if and only if there exists w G 3£2(z + ) such that z + — [k]+ + A.w = 0. Similarly, z is optimal 
if and only if there exists (s,u) G 3£2(z) x 3\£ k p(z) such that z — k + as + u = 0. We now prove 
that [k]_ = K— [k] + belongs to 9{^_p (z + ). We proceed by contradiction. Let us assume that there 
exists z G MP + such that [k]I(z — z + ) > 0. This implies that there exists j G {!,...,/>} for which 



K 



./J 



< and Zj — zt < 0. In other words, we have < zj = zj — [k/]+ < zt = zt — [K/J+. With 
the assumption made on £2 and replacing zt by zj, we have found a solution to the left-hand side 
of (15) with a stricly smaller cost function than the one evaluated at z + , hence the contradiction. 
Putting the pieces together, we now have 

z+ - [k]+ + aav = z+ -k + Xw+ [k]_ = 0, with (w, [k]_) G 3n(z+) x 5V; M /|(z + ), 

which shows that z + is the solution of the right-hand side of (15). ■ 



Appendix C. Counterexample for ^-norms, with q ^ {1,2,°°}. 

The result we have proved in Proposition 1 in the specific setting where |.| is the I2- or £oo-norm 
does not hold more generally for ^-norms, when q is not in {1,2, 00}. Let q > 1 satisfying this 

condition. We denote by q' = (1 — q^ 1 ) _1 the norm parameter dual to q. We keep the same notation 
as in Lemma 2 and assume from now on that ||U| g || ? ' >? g and ||U|/,|| g ' >t g +t/ t . These two inequalities 
guarantee that the vectors U| g and — do not lie in the interior of the ^'-norm balls, of respective 
radius t g and t%. 

We show in this section that there exists a setting for which the conclusion of Lemma 2 does not 
hold anymore. We first focus on a necessary condition of Lemma 2: 

Lemma 8 (Necessary condition of Lemma 2) 

Let |.| be a i q -norm, with q ^ {1,2, 00}. If the conclusion of Lemma 2 holds, then the vectors ^ 
and ^ are linearly dependent. 

Proof According to our assumptions on and — we have that ||^|| 9 ' = t g and ||^*||^ = t%. 
In this case, we can apply the second optimality conditions of Lemma 5, which states that equality 
holds in the £ q -£ q > Holder inequality. As a result, there exists p g , p/, > such that for all j in g: 

\^/ = p g \uj-^ and |^K=p,|u,-^-^. (16) 

If the conclusion of Lemma 2 holds — that is, we have £f = ITy^ (u^ — ^), notice that it is not 
possible to have the following scenarios, as proved below by contradiction: 
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• If ||U| g — Z^gWq 1 < ts>, then we would have = u )g — ^' g , which is impossible since = t g . 

• If ||u| g — ^ g \\q' = t g , then we would have for all j in g, \^j\ q ' = p/Juy — ^ — ^\ q = 0, which 
implies that ^ = and ||U|g|| 9 / = t g . This is impossible since we assumed Hu^H^ > t g . 

We therefore have ||U| g — ^ g \\ q ' > tg and using again the second optimality conditions of Lemma 5, 

there exists p > such that for all j in g, |^yK = p|u ; - — ^ — ^!)\ q - Combined with the previous 

relation on E , we obtain for all j in g, |^|? = Since we can assume without loss of 

generality that u only has nonnegative entries (see Lemma 6), the vectors t, 8 and ^' can also be 
assumed to have nonnegative entries, hence the desired conclusion. ■ 

We need another intuitive property of the projection ILi ii < t to derive our counterexample: 
Lemma 9 (Order-preservation by projection) 

Let \\.\\ be a l q -norm, with q ^ {1,°°} and q' = 1/(1 — q~ l ). Let us consider the vectors K,w E W 
such that K = II|| .||„<f(w) = argmin|| y || /<t ||y — w||2, with the radius t satisfying ||w|| 9 / > t. If we 

have w,- < v/jfor some (i,j) in {1, . . . ,p} 2 , then it also holds that K,- < K/. 

Proof Let us first notice that given the assumption on t, we have ||k||^ = t. The Lagrangian L 
associated with the convex minimization problem underlying the definition of TTii n <t can be written 

as 

L (y,oc) = — ||y — w||2 + oc[||y||^ — t q ] , with the Lagrangian parameter a > 0. 

At optimality, the stationarity condition for K leads to 

V j E {1, . . . ,p}, Kj - wj + aql*/- 1 = 0. 

We can assume without loss of generality that w only has nonnegative entries (see Lemma 6). Since 
the components of K and w have the same signs (see Lemma 6), we therefore have |K;| = Kj > 0, 
for all j in {1, . . . ,p}. Note that a cannot be equal to zero because of ||k||^ = t < ||w|| g '. 

Let us consider the continuously differentiable function cp w : K i— > K — w + aq'K q ~ 1 defined on 
(0,°°). Since cp»,(0) = — w < 0, lim K ^ooCp H ,(K) = oo and cp v „ is strictly nondecreasing, there exists a 
unique K* > such that cp v ,.(K*,) = 0. If we now take w < v, we have 

q>v«) = <PwK) + w- v = w- v<0 = cp v «). 

With q>,, being strictly nondecreasing, we thus obtain K*. < K*. The desired conclusion stems from 
the application of the previous result to the stationarity condition of K. ■ 

Based on the two previous lemmas, we are now in position to present our counterexample: 
Proposition 2 (Counterexample) 

Let ||.|| be a i q -norm, with q ^ {l,2,oo} and q' = 1/(1 — q~ l ). Let us consider Q = {g,h}, with 
g C h C {1,. . . ,p} and \g\ > 1. Let u be a vector in W 7 that has at least two different nonzero 
entries in g, i.e., there exists (i,j) in g x g such that < |u,| < |iiy|. Let us consider the successive 
projections 

^= n i|.||»<J%) and ^ = U \\.\U<t,X U \h-Z= S ) 
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with t g ,tf, > satisfying \\u\g\\qi > t g and \\u\h\\q' > t g +th- Then, the conclusion of Lemma 2 does 
not hold. 

Proof We apply the same rationale as in the proof of Lemma 9. Writing the stationarity conditions 
for %f and £ , we have for all j in g 

tf + a q >(^'- l -Uj = 0, and $ + W<$)*- 1 - (uj-£°) =0, (17) 

with Lagrangian parameters a, (3 > 0. We now proceed by contradiction and assume that £, 8 = 
iT|| || t <^(U| g — ^). According to Lemma 8, there exists p > such that for all j in g, ^ = p^. If 

we combine the previous relations on and we obtain for all j in g, 

j j p 

If C < 0, then we have a contradiction, since the entries of £, 8 and U| g have the same signs. Similarly, 
the case C = leads a contradiction, since we would have U| g = and > t g . As a conse- 

quence, it follows that C > and for all j in g, ^ = exp { '° g ^ } , which means that all the entries 
of the vector ^ are identical. Using Lemma 9, since there exists (i, j) £ g x g such that u,- < u ; , we 
also have £f < ^, which leads to a contradiction. ■ 
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